Unsupervised Learning Model


Running cells with '.venv (Python -1.-1.-1)' requires the ipykernel package.

Install 'ipykernel' into the Python environment. 

Command: '"/Users/macbook/Desktop/Masters/Web Analytics for Business/Repositories/ad688-employability-sp25A1-group1-4/.venv/bin/python" -m pip install ipykernel -U --force-reinstall'

Data Processing

Text Preprocessing: Combine Job Title and Skills into a Single Field for TF-IDF

# Combining fields
df['TITLE_CLEAN'] = df['TITLE_CLEAN'].fillna('unknown').astype(str).str.strip().str.lower()
df['SOFTWARE_SKILLS_NAME'] = df['SOFTWARE_SKILLS_NAME'].fillna('').astype(str).str.lower()
df['SPECIALIZED_SKILLS_NAME'] = df['SPECIALIZED_SKILLS_NAME'].fillna('').astype(str).str.lower()     

# Combine text fields for TF-IDF
df['combined_text'] = df['TITLE_CLEAN'] + ' ' + df['SOFTWARE_SKILLS_NAME'] + ' ' + df['SPECIALIZED_SKILLS_NAME']
  • We combined the job title and skills into a single text field (combined_text) to create a richer, unified input for the TF-IDF vectorizer. This improves the quality of feature extraction by capturing more context about each job, enabling better clustering and analysis.

Unique Value Counts in Job Titles and Skill Fields

Unique values in 'TITLE_CLEAN': 27266
Unique values in 'SOFTWARE_SKILLS_NAME': 22456
Unique values in 'SPECIALIZED_SKILLS_NAME': 41462
  • It helps to assess the diversity and granularity of job titles and skill mentions before clustering or vectorization, which is important for understanding feature richness and potential noise in the dataset.

NLP + K Means Clustering


Text Vectorization and Feature Scaling for Clustering


#| eval: true
#| echo: false

# Vectorize
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf.fit_transform(df['combined_text']).toarray()

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_tfidf) 
  • We performed text vectorization and feature scaling, which are essential preprocessing steps before clustering

  • Tfidf Vectorizer converts the cleaned job and skills text (combined_text) into a numeric matrix based on word importance (TF-IDF), enabling text-based clustering.

  • StandardScaler scales the TF-IDF features to have zero mean and unit variance, which is important because KMeans is sensitive to feature magnitudes.

KMeans Clustering and Evaluation with NAICS 6-Digit Labels

Adjusted Rand Index (NAICS_2022_6_NAME): 0.009
Normalized Mutual Info Score (NAICS_2022_6_NAME): 0.033

Evaluate Clustering Using Multiple Reference Labels (NAICS, SOC, ONET)


#| eval: true
#| echo: false

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

reference_labels = ['NAICS_2022_6_NAME', 'SOC_2021_5_NAME', 'ONET_NAME']
results = []

for label in reference_labels:
    df_eval = df[[label, 'cluster']].dropna()
    ari = adjusted_rand_score(df_eval[label], df_eval['cluster'])
    nmi = normalized_mutual_info_score(df_eval[label], df_eval['cluster'])
    results.append({'Reference Label': label, 'ARI': ari, 'NMI': nmi})
Reference Label ARI NMI
0 NAICS_2022_6_NAME 0.0092 0.0331
1 SOC_2021_5_NAME 0.0000 0.0000
2 ONET_NAME 0.0000 0.0000
  • NAICS_2022_6_NAME has the highest agreement with clusters (though still very low), suggesting a slight alignment with industry-based classification.

  • SOC and ONET labels have zero alignment — meaning the clusters derived from TF-IDF features of job titles + skills do not correspond to occupation-based taxonomies.

Visualize TF-IDF-Based Clusters with PCA and Plotly

Interpretation:

  • The three clusters (color-coded) are distinct in the PCA space, suggesting that the clustering algorithm was able to differentiate based on text patterns.

  • This model is capturing textual similarity (e.g., shared tools, terms, or phrasing in job descriptions), not necessarily formal job classifications.

  • Cluster boundaries are data-driven, not taxonomy-aligned.


Top Terms Representing Each Cluster (TF-IDF Feature Importance)


 Cluster 0:
pmi, apple, institute, ios, android, vmware, desktop, methodology, expectation, zachman, windows, infrastructure, capability, operating, subcontracting

 Cluster 1:
data, language, programming, sql, intelligence, python, tableau, analysis, dashboard, bi, power, statistics, visualization, analyst, analytics

 Cluster 2:
sap, enterprise, consultant, applications, oracle, functional, management, planning, cloud, architect, architecture, solution, design, erp, resource

Nomenclature of our Clusters

  • Cluster 0 = “IT Infrastructure & Support”

  • Cluster 1 = “Data Analytics & BI”

  • Cluster 2 = “Enterprise Applications & Consulting”


Visualizing Representative Job Titles Across Clusters

Cluster 0 (“IT Infrastructure & Support”)

→ Jobs like enterprise support analyst, senior IT analyst, data integration analyst, IT enterprise architect.

→ These titles are support, IT system maintenance, integration, and architecture focused.

Cluster 1 (“Data Analytics & BI”)

→ Jobs like sr BI analyst, data analyst, data scientist, data research analyst.

→ Heavy analytics, business intelligence (BI), data science skills — matches perfectly.

Cluster 2 (“Enterprise Applications & Consulting”)

→ Jobs like SAP BTP consultant, ERP integrations analyst, applications consultant, product architect.

→ Clearly related to enterprise software (SAP, ERP) and consulting roles.


Preprocessing Software Skills for Analysis


Clustered Software Skill Visualization

Interpretation

Cluster 0 (“IT Infrastructure & Support”)

Common skills: Microsoft Excel, Microsoft SharePoint, Docusign, SAP Applications, TOGAF, automated cost tools.

→ These tools are typical for IT operations, documentation, system architecture support.

Cluster 1 (“Data Analytics & BI”)

Common skills: Python, SQL (PL/SQL), Looker, Tableau, Power BI, Google Analytics, Qlik Sense.

→ Clear focus on analytics, data visualization, and programming languages.

Cluster 2 (“Enterprise Applications & Consulting”)

Common skills: SAP Sales and Distribution, Google Cloud Platform (GCP), Microsoft OneNote, IBM Maximo.

→ These are enterprise-level software systems for consulting, ERP, and large infrastructure projects.

Conclusion:

The software skills distribution perfectly matches the previously assigned cluster themes based on job titles and top terms.


Average salary per cluster

Average salary in Cluster 0: $132,148.75
Average salary in Cluster 1: $105,813.00
Average salary in Cluster 2: $127,433.09

Salary Distribution Across Clusters

Interpretation for the Boxplot

Boxplot shows the salary distribution across three clusters derived from unsupervised KMeans clustering on job titles and skills:

Cluster 0:

  • Has a relatively high median salary (~$125K) and moderate spread, suggesting roles with consistent mid-to-high pay (e.g., enterprise or management roles).

Cluster 1:

  • Has the lowest median salary (~$95K) with many outliers, indicating entry-to-mid-level roles with high variance (e.g., data or analyst roles).

Cluster 2:

  • Shows the widest salary range with the highest outliers (up to $500K), implying this cluster contains senior or highly specialized roles (e.g., consultants or architects).

  • Overall, the plot reflects meaningful salary differences between the clusters, supporting the relevance of clustering for job role segmentation.


Conclusion and Key Takeaways

we applied KMeans clustering to job postings using text data from job titles and associated skills (software + specialized). Despite relatively low alignment with external classification labels like SOC, and ONET (as shown by ARI and NMI scores), our analysis still uncovered distinct, interpretable clusters with practical insights.

Clustering Pipeline Summary

  • Text Preprocessing: Combined job title, software, and specialized skills into a combined_text field.

  • Vectorization: Used TfidfVectorizer to convert text to numerical features.

  • Scaling: Applied StandardScaler to normalize TF-IDF vectors.

  • Clustering: Ran KMeans with k=3 clusters.

Evaluation:

  • ARI (Adjusted Rand Index): Max ~0.009 with NAICS_2022_6_NAME

  • NMI (Normalized Mutual Info): Max ~0.033

Interpretation: Clusters do not align well with predefined industry/occupation codes, which is expected in unsupervised learning.

Key Visual Insights

1. PCA Projection

The PCA plot revealed clear separation between clusters, indicating the clustering algorithm did find structural patterns in job descriptions.

2. Top Terms per Cluster

  • Cluster 0: Keywords like apple, ios, vmware, infrastructure suggest tech roles focused on devices, systems, and IT frameworks.

  • Cluster 1: Terms like sql, tableau, python, analysis indicate data-related roles (analysts, BI, data scientists).

  • Cluster 2: Words like sap, oracle, consultant, planning suggest enterprise solutions, consultants, or ERP specialists.

3. Sample Job Titles by Cluster

Confirms term-based interpretations:

  • Cluster 0: IT infrastructure & support

  • Cluster 1: Data analysts and BI roles

  • Cluster 2: SAP/ERP consultants and architects

4. Software Skills by Cluster

  • Cluster 0: Excel, SharePoint, PowerPoint – general office + support tools

  • Cluster 1: Python, Tableau, Power BI – analytics & data tools

  • Cluster 2: SAP, Oracle, GCP – enterprise software and cloud tools

5. Salary Distribution by Cluster

  • Cluster 0: Mid-range salaries, low outliers – stable IT roles

  • Cluster 1: Lower median salaries, wide spread – junior data roles

  • Cluster 2: High median and extreme outliers – senior consultants & architects

Final Takeaways

Even with low overlap to government taxonomies (NAICS/SOC/ONET), clustering successfully revealed latent role patterns based on real-world skills and job titles.

Unsupervised clustering can meaningfully group job postings by functional role, skill stack, and salary range, offering powerful segmentation for:

  • Career recommendation systems

  • Skill gap analysis

  • Compensation benchmarking

  • Targeted recruitment strategies